Semi-supervised Learning from Unbalanced Labeled Data - An Improvement
نویسندگان
چکیده
We present a great improvement while performing semi-supervised learning tasks from training data sets when only a small fraction of the data pairs is labeled. In particular, we propose a novel decision strategy based on normalized model outputs. We give the explanation why the normalization step helps. The paper compares performances of two popular semi-supervised approaches (Consistency Method and Harmonic Gaussian Model) on the unbalanced and balanced labeled data by using normalization of the models’ outputs and without it. Experiments on text categorization problems suggest significant improvements in classification performances for models that use normalized outputs as a basis for final decision.
منابع مشابه
Detecting Concept Drift in Data Stream Using Semi-Supervised Classification
Data stream is a sequence of data generated from various information sources at a high speed and high volume. Classifying data streams faces the three challenges of unlimited length, online processing, and concept drift. In related research, to meet the challenge of unlimited stream length, commonly the stream is divided into fixed size windows or gradual forgetting is used. Concept drift refer...
متن کاملSemi-Supervised Never-Ending Learning in Rhetorical Relation Identification
Some languages do not have enough labeled data to obtain good discourse parsing, specially in the relation identification step, and the additional use of unlabeled data is a plausible solution. A workflow is presented that uses a semi-supervised learning approach. Instead of only a predefined additional set of unlabeled data, texts obtained from the web are continuously added. This obtains near...
متن کاملFilling the Gap: Semi-Supervised Learning for Opinion Detection Across Domains
We investigate the use of Semi-Supervised Learning (SSL) in opinion detection both in sparse data situations and for domain adaptation. We show that co-training reaches the best results in an in-domain setting with small labeled data sets, with a maximum absolute gain of 33.5%. For domain transfer, we show that self-training gains an absolute improvement in labeling accuracy for blog data of 16...
متن کاملModel Selection for Semi-Supervised Learning with Limited Labeled Data
An important component for making semi-supervised learning applicable to real world data is the task of model selection. For the case of very limited labeled data, for which semi-supervised learning algorithms have the greatest potential to offer improvement in estimating predictive models, model selection is a significant challenge, a key open problem, and often avoided entirely in previous wo...
متن کاملOn semi-supervised learning of Gaussian Mixture Models for phonetic classification
This paper investigates semi-supervised learning of Gaussian mixture models using an unified objective function taking both labeled and unlabeled data into account. Two methods are compared in this work – the hybrid discriminative/generative method and the purely generative method. They differ in the criterion type on labeled data; the hybrid method uses the class posterior probabilities and th...
متن کامل